I am very interested in visualizations. Especially, geospatial or geographic visualization! It is all the more better if the visualization is interactive.
For my Data Incubator fellowship application challenge I tried to create a spatial animation of daily confirmed cases and deaths, in Indiana, from COVID-19 as reported by Johns Hopkins University’s interactive webiste (Dong, Du, and Gardner 2020).
One of my persistent annoyances (and frustrations) is the central incompatibility between how the commonly available spatial data is organized and how the data needs to be organized to be used for analytics. The central issue is this: most spatial data is stored in wide1 format whereas most analytics software needs the data to be in long format. And, not to mention all the various minor details (esoteric file formats of spatial data, missing/invalid data, R’s automatic type conversion, etc.) that need to be considered to successfully join spatial data with attribute data before you can even do any analytics. Lest you think I have been living under a rock I am quite aware, from my own past experiences and also reading articles and blogs similar to this NYT article, that data munging can take surprisingly more time than anticipated. But despite my better judgement I thought: how hard can this really be? Right away you know that you are reading the words of a novice data scientist!
There are two data components needed to achieve my objective: 1) COVID-19 time series data, and 2) Geographic boundaries data.
## load the libraries to get started...
library("data.table")
library("sf")
library("ggplot2")
Johns Hopkins University (JHU) provides the data in CSV format at https://github.com/CSSEGISandData/COVID-19. Exploring the folders, I surmised that I needed to get the time series data. So, I downloaded the CSV files time_series_covid19_confirmed_US.csv and time_series_covid19_deaths_US.csv. Some of you might be wondering: why the heck didn’t you just clone the repository? to which I reply: sorry, I was being naive.2
Let’s read the data and see what we have.
confirmed <- fread("time_series_covid19_confirmed_US.csv", colClasses="character")
dim(confirmed)
## [1] 3261 172
confirmed[1:7, 1:15]
## UID iso2 iso3 code3 FIPS Admin2 Province_State
## 1: 16 AS ASM 16 60.0 American Samoa
## 2: 316 GU GUM 316 66.0 Guam
## 3: 580 MP MNP 580 69.0 Northern Mariana Islands
## 4: 630 PR PRI 630 72.0 Puerto Rico
## 5: 850 VI VIR 850 78.0 Virgin Islands
## 6: 84001001 US USA 840 1001.0 Autauga Alabama
## 7: 84001003 US USA 840 1003.0 Baldwin Alabama
## Country_Region Lat Long_ Combined_Key 1/22/20
## 1: US -14.271 -170.132 American Samoa, US 0
## 2: US 13.4443 144.7937 Guam, US 0
## 3: US 15.0979 145.6739 Northern Mariana Islands, US 0
## 4: US 18.2208 -66.5901 Puerto Rico, US 0
## 5: US 18.3358 -64.8963 Virgin Islands, US 0
## 6: US 32.53952745 -86.64408227 Autauga, Alabama, US 0
## 7: US 30.72774991 -87.72207058 Baldwin, Alabama, US 0
## 1/23/20 1/24/20 1/25/20
## 1: 0 0 0
## 2: 0 0 0
## 3: 0 0 0
## 4: 0 0 0
## 5: 0 0 0
## 6: 0 0 0
## 7: 0 0 0
head(colnames(confirmed), 20)
## [1] "UID" "iso2" "iso3" "code3"
## [5] "FIPS" "Admin2" "Province_State" "Country_Region"
## [9] "Lat" "Long_" "Combined_Key" "1/22/20"
## [13] "1/23/20" "1/24/20" "1/25/20" "1/26/20"
## [17] "1/27/20" "1/28/20" "1/29/20" "1/30/20"
deaths <- fread("time_series_covid19_deaths_US.csv", colClasses="character")
dim(deaths)
## [1] 3261 173
deaths[1:7, 1:15]
## UID iso2 iso3 code3 FIPS Admin2 Province_State
## 1: 16 AS ASM 16 60 American Samoa
## 2: 316 GU GUM 316 66 Guam
## 3: 580 MP MNP 580 69 Northern Mariana Islands
## 4: 630 PR PRI 630 72 Puerto Rico
## 5: 850 VI VIR 850 78 Virgin Islands
## 6: 84001001 US USA 840 01001 Autauga Alabama
## 7: 84001003 US USA 840 01003 Baldwin Alabama
## Country_Region Lat Long_ Combined_Key
## 1: US -14.271 -170.132 American Samoa, US
## 2: US 13.4443 144.7937 Guam, US
## 3: US 15.0979 145.6739 Northern Mariana Islands, US
## 4: US 18.2208 -66.5901 Puerto Rico, US
## 5: US 18.3358 -64.8963 Virgin Islands, US
## 6: US 32.53952745 -86.64408227 Autauga, Alabama, US
## 7: US 30.72774991 -87.72207058 Baldwin, Alabama, US
## Population 1/22/20 1/23/20 1/24/20
## 1: 55641 0 0 0
## 2: 164229 0 0 0
## 3: 55144 0 0 0
## 4: 2933408 0 0 0
## 5: 107268 0 0 0
## 6: 55869 0 0 0
## 7: 223234 0 0 0
head(colnames(deaths), 20)
## [1] "UID" "iso2" "iso3" "code3"
## [5] "FIPS" "Admin2" "Province_State" "Country_Region"
## [9] "Lat" "Long_" "Combined_Key" "Population"
## [13] "1/22/20" "1/23/20" "1/24/20" "1/25/20"
## [17] "1/26/20" "1/27/20" "1/28/20" "1/29/20"
As evident, this data is organized in a wide format where rows correspond to the counties of US and the columns correspond to attributes of the county and a column for each date, starting 2020.01.22, when JHU started tracking data. Looking at the dimensions, especially the number of columns, of these data I definitely needed to prepare the dataset in order to join it with the spatial data.
The FIPS related formatting is because time_series_covid19_confirmed_US.csv stores the FIPS column as a numeric with a decimal 0 (see row 6)! In fact, I spent more than four hours trying to chase data type issues3 before finally deciding to read the data with all columns as character vectors and then do the necessary data conversion myself. I began by fixing FIPS, which is the key column needed for joining the spatial data.
confirmed[, FIPS:=as.character(as.integer(FIPS))]
confirmed[nchar(FIPS)==4, FIPS:=sprintf("0%s",FIPS)]
deaths[, FIPS:=as.character(as.integer(FIPS))]
deaths[nchar(FIPS)==4, FIPS:=sprintf("0%s",FIPS)]
Anyone who has ever worked with any GIS software has inevitably had to deal with shapefiles. While they might have been acceptable for a time period when they were created they are not suited for modern times. The GIS community, at least the open source geospatial community, realizes these shortcomings and is coming up with alternative formats such as GeoPackage, GeoJSON. But visit any of the federal, state, or local government websites, and you are very likely to run into spatial data provided only as shapefile, usually offered in a zip archive. There are many issues with shapefiles (just check out http://switchfromshapefile.org/)…but I know how to deal with all of them…just convert it to a GeoPackage! So, I downloaded the US county shapefile (cb_2018_us_county_500k.zip)4 from US Census Bureau’s Cartographic Boundary Files - Shapefile and converted it to a geopackage.
## d <- st_read("cb_2018_us_county_500k.shp")
## st_write(d, "counties.gpkg")
## rm(d)
(counties_geo <- st_read("counties.gpkg", quiet=TRUE))
## Simple feature collection with 3233 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -179.1489 ymin: -14.5487 xmax: 179.7785 ymax: 71.36516
## geographic CRS: NAD83
## First 10 features:
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD ALAND
## 1 21 007 00516850 0500000US21007 21007 Ballard 06 639387454
## 2 21 017 00516855 0500000US21017 21017 Bourbon 06 750439351
## 3 21 031 00516862 0500000US21031 21031 Butler 06 1103571974
## 4 21 065 00516879 0500000US21065 21065 Estill 06 655509930
## 5 21 069 00516881 0500000US21069 21069 Fleming 06 902727151
## 6 21 093 00516893 0500000US21093 21093 Hardin 06 1614569777
## 7 21 099 00516896 0500000US21099 21099 Hart 06 1068530028
## 8 21 131 00516912 0500000US21131 21131 Leslie 06 1038206077
## 9 21 151 00516919 0500000US21151 21151 Madison 06 1132729653
## 10 21 155 00516921 0500000US21155 21155 Marion 06 888463701
## AWATER geom
## 1 69473325 MULTIPOLYGON (((-89.18137 3...
## 2 4829777 MULTIPOLYGON (((-84.44266 3...
## 3 13943044 MULTIPOLYGON (((-86.94486 3...
## 4 6516335 MULTIPOLYGON (((-84.12662 3...
## 5 7182793 MULTIPOLYGON (((-83.98428 3...
## 6 17463238 MULTIPOLYGON (((-86.27756 3...
## 7 13692536 MULTIPOLYGON (((-86.16112 3...
## 8 9189732 MULTIPOLYGON (((-83.5531 37...
## 9 15306635 MULTIPOLYGON (((-84.52564 3...
## 10 9891797 MULTIPOLYGON (((-85.52129 3...
Some things to notice are:
Grand Princess, Federal Corrections (MDOC) etc. which are irrelevant for my purposes.GEOID column is comprised of concatenating STATEFP and COUNTYFP columns. You will also note that table(nchar(counties_geo$GEOID)) yields 3233 and that’s why we had to ‘0’-prefix some of our FIPS rows earlier.GEOID and FIPS columns can be used to join/relate the attribute and spatial tables.While we can definitely use the tabular data with the geospatial data “as is”, there are a few things that we may want to consider:
Since this timeseries is updated daily, there will be more columns as newer data are added.
Filtering, especially date range based, is going to involve selecting different columns. For instance, exploring cases/deaths between April 1, 2020 to June 15, 2020 will require figuring out which columns to select. And, then a little while later if we wish to explore cases/deaths between April 15, 2020 to June 30, 2020 then we will again have to figure out which columns to select.
Contrast that with storing this data in a long format, where there are three columns: one for date, another for confirmed cases, and the third column for deaths. Now selecting based on date ranges is a simple matter of row filtering that can be done by R’s builtin indexing or other data munging packages (data.table, dplyr, tidyverse, reshape).
Determining a date (or date ranges) based on some combination of values (either confirmed case or deaths or some transformation of them) are going to be rather tedious.
Hence, it would be good if we can convert this to long data format. By the way, this is the same format in which many of the analytics oriented applications, R included, view the data. That is, instead of thinking of a table consisting of rows we should think of it as a collection of columns, each of which is a vector of values! This is why inquiring R about a data.frame’s length (for instance length(mtcars)) will yield the number of columns (i.e., dim(mtcars)[[2L]] which is 11). This view of the data is called an “inverted table” (Hui and Kromberg 2020 pg 11) and is the predominant view of columnar databases and analytic processing systems! Wide-to-long conversion is commonly called a “melt” operation in R parlance. Therefore let us melt our data.
deaths.id.vars <- 1:12 ## head(colnames(deaths), 14) ... just to verify
confirmed.id.vars <- 1:11 ## head(colnames(confirmed), 13)
deaths.long <- melt(deaths, id.vars=colnames(deaths)[deaths.id.vars],
measure.vars=colnames(deaths)[-deaths.id.vars],
variable.name="date1", value.name="deaths")
confirmed.long <- melt(confirmed, id.vars=colnames(confirmed)[confirmed.id.vars],
measure.vars=colnames(confirmed)[-confirmed.id.vars],
variable.name="date1", value.name="confirmed_cases")
stopifnot(nrow(deaths) * (ncol(deaths)-length(deaths.id.vars)) ==
nrow(deaths.long))
stopifnot(nrow(confirmed) * (ncol(confirmed)-length(confirmed.id.vars)) ==
nrow(confirmed.long))
# just-in-case date formatting messes up we need the original date.
deaths.long[, `:=`(ddate1 = strftime(as.Date(as.character(date1), format="%m/%d/%y"),
format="%Y-%m-%d")
,deaths=as.integer(deaths) )]
confirmed.long[, `:=`(ddate1=strftime(as.Date(as.character(date1), format="%m/%d/%y"),
format="%Y-%m-%d")
,confirmed_cases = as.integer(confirmed_cases))]
We melt the data by using the first few columns as id variables and all the date columns as measure variables. It is a good idea to use stopifnot statements to ensure that the transform data conforms to what we are expecting. Starting from R version 4.0 stopifnot now accepts custom error messages via argument names to make argument checking easier.
Now that we have our data, let us try to plot it. While I wanted to create an animation for all the counties in US I ran into lots of issues. So, I decided to just do a small example for Indiana counties.
IN_geo <- counties_geo[grepl("^18", counties_geo$GEOID),]
deaths_IN <- deaths.long[grepl("^18", FIPS),]
confirmed_IN <- confirmed.long[grepl("^18", FIPS), ]
deaths_20200627 <- deaths_IN[ddate1==as.Date('2020-06-27'), ]
cases_20200627 <- confirmed_IN[ddate1==as.Date('2020-06-27'), ]
par(mfrow=c(1,2))
d <- merge(IN_geo, deaths_20200627, by.x="GEOID", by.y="FIPS")
plot(d[, "deaths"], main="Deaths 2020-06-27", breaks='kmeans')
d <- merge(IN_geo, cases_20200627, by.x="GEOID", by.y="FIPS")
plot(d[, "confirmed_cases"], main="Confirmed cases 2020-06-27", breaks='kmeans')
Now that we know how to create a plot let’s try to create a whole bunch of images which we can combine into an animation.
outdir <- "figs"
dts <- unique(deaths_IN$ddate1)
## for(d in dts[155:160]) {
for(d in dts) {
deaths_png <- file.path(outdir, sprintf("deaths_%s.png", d))
cases_png <- file.path(outdir, sprintf("cases_%s.png", d))
dd <- deaths_IN[ddate1==d, .(FIPS, deaths, ddate1)]
cc <- confirmed_IN[ddate1==d,.(FIPS, confirmed_cases, ddate1)]
ii <- IN_geo[, c("NAME", "GEOID", "geom")]
dat <- merge(ii, cc, by.x="GEOID", by.y="FIPS")
dat$confirmed_cases <- dat$confirmed_cases/sum(dat$confirmed_cases)
if(!file.exists(cases_png)) {
png(cases_png)
## plot(dat[, "confirmed_cases"], main=d, key.pos=NULL, border=0L, breaks="kmeans")
plot(dat[, "confirmed_cases"], main=d, breaks="kmeans")
dev.off()
}
dat <- merge(ii, dd, by.x="GEOID", by.y="FIPS")
dat$deaths <- dat$deaths/sum(dat$deaths)
if(!file.exists(deaths_png)) {
png(deaths_png)
## plot(dat["deaths"], main=d, key.pos=NULL, border=0L, breaks="kmeans")
plot(dat["deaths"], main=d, breaks="kmeans")
dev.off()
}
}
And, finally we can combine all of these pngs into a gif using ImageMagick using something like convert figs/cases*.png cases.gif. Here is the very rudimentary result of my unsuccessful attempt.
Working with geospatial data is a lot trickier than anticipated. I tend to forget this fact quite often and have to make a conscious, and regular, effort to keep this in mind.
Use git and make more often so that I become familiar enough with these tools to use them unconsciously.
Try to always remeber:
It always takes longer than you expect, even if you take Hofstadter’s Law into account. – Douglas Hofstadter
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.2 sf_1.0-0 data.table_1.12.8
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 compiler_4.0.2 pillar_1.4.4 class_7.3-17
## [5] tools_4.0.2 digest_0.6.25 evaluate_0.14 lifecycle_0.2.0
## [9] tibble_3.0.2 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.6
## [13] DBI_1.1.0 yaml_2.2.1 xfun_0.15 e1071_1.7-3
## [17] withr_2.2.0 dplyr_1.0.0 stringr_1.4.0 knitr_1.29
## [21] generics_0.0.2 vctrs_0.3.1 classInt_0.4-3 grid_4.0.2
## [25] tidyselect_1.1.0 glue_1.4.1 R6_2.4.1 rmarkdown_2.3
## [29] purrr_0.3.4 magrittr_1.5 scales_1.1.1 htmltools_0.5.0
## [33] ellipsis_0.3.1 units_0.6-7 colorspace_1.4-1 KernSmooth_2.23-17
## [37] stringi_1.4.6 munsell_0.5.0 crayon_1.3.4
Dong, Ensheng, Hongru Du, and Lauren Gardner. 2020. “An Interactive Web-Based Dashboard to Track COVID-19 in Real Time.” The Lancet Infectious Diseases 20 (5): 533–34. https://doi.org/10.1016/S1473-3099(20)30120-1.
Hui, Roger K. W., and Morten J. Kromberg. 2020. “APL Since 1978.” Proceedings of the ACM on Programming Languages 4 (HOPL): 1–108. https://doi.org/10.1145/3386319.
Firstly, I rejoice that I can even comprehend what you mean by this question. Secondly, as far as I’m aware git is not an integral part of much of GIS (my own area of comfort) workflows. But, most important of all, I have not yet internalized, and integrated, git comprehensively into my own workflow!↩︎
I could never figure out whether it was R’s automatic data conversion or my heavy customization that was causing all these issues. I tried running R --vanilla which only increased my problems…sigh!↩︎
Since I was mostly interested in getting the data into R it did not matter whehter I chose shapefile or geodatabase. I would have definitely chosen geodatabase if the data size would have exceeded shapefile’s limitations.↩︎